Supermat Grocery Sales Analysis Project¶
STEP-1 import important libraries for analyze the data and visualize, cleaning and processing the data¶
In [ ]:
In [1]:
import pandas as pd # for data cleaning and processing the data
import numpy as np # for linear algebra
import matplotlib.pyplot as plt #for visualization
import seaborn as sns
import warnings
warnings.filterwarnings('ignore') # this for is igore the warning
STEP-2 load the dataset¶
import data in csv¶
In [2]:
df=pd.read_csv("Supermat_Grocery_Sales_dataset.csv")
In [3]:
df.head()
Out[3]:
| Order ID | Customer Name | Category | Sub Category | City | Order Date | Region | Sales | Discount | Profit | State | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | OD1 | Harish | Oil & Masala | Masalas | Vellore | 11-08-2017 | North | 1254.0 | 0.12 | 401.28 | Tamil Nadu |
| 1 | OD2 | Sudha | Beverages | Health Drinks | Krishnagiri | 11-08-2017 | South | 749.0 | 0.18 | 149.80 | Tamil Nadu |
| 2 | OD3 | Hussain | Food Grains | Atta & Flour | Perambalur | 06-12-2017 | West | 2360.0 | 0.21 | 165.20 | Tamil Nadu |
| 3 | OD4 | Jackson | Fruits & Veggies | Fresh Vegetables | Dharmapuri | 10-11-2016 | South | 896.0 | 0.25 | 89.60 | Tamil Nadu |
| 4 | OD5 | Ridhesh | Food Grains | Organic Staples | Ooty | 10-11-2016 | South | 2355.0 | 0.26 | 918.45 | Tamil Nadu |
In [4]:
df.columns
Out[4]:
Index(['Order ID', 'Customer Name', 'Category', 'Sub Category', 'City',
'Order Date', 'Region', 'Sales', 'Discount', 'Profit', 'State'],
dtype='object')
In [5]:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 9994 entries, 0 to 9993 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Order ID 9994 non-null object 1 Customer Name 9994 non-null object 2 Category 9994 non-null object 3 Sub Category 8504 non-null object 4 City 8504 non-null object 5 Order Date 8504 non-null object 6 Region 8504 non-null object 7 Sales 8504 non-null float64 8 Discount 8504 non-null float64 9 Profit 8504 non-null float64 10 State 8504 non-null object dtypes: float64(3), object(8) memory usage: 859.0+ KB
In [6]:
df.shape # this function is show the row and column
Out[6]:
(9994, 11)
In [7]:
df.duplicated().sum() # this duplicated function is show the duplicated value
Out[7]:
0
In [8]:
df.isnull().sum().sum() # this show tthe null value in data
Out[8]:
11920
In [9]:
(df.isnull().sum().sum()/df.shape[0])*100
Out[9]:
119.27156293776267
In [10]:
df.isnull().sum()
Out[10]:
Order ID 0 Customer Name 0 Category 0 Sub Category 1490 City 1490 Order Date 1490 Region 1490 Sales 1490 Discount 1490 Profit 1490 State 1490 dtype: int64
In [11]:
(df.isnull().sum()/df.shape[0])*100
Out[11]:
Order ID 0.000000 Customer Name 0.000000 Category 0.000000 Sub Category 14.908945 City 14.908945 Order Date 14.908945 Region 14.908945 Sales 14.908945 Discount 14.908945 Profit 14.908945 State 14.908945 dtype: float64
In [12]:
df.dropna(inplace=True) #drop function is delete the null row
In [13]:
df.shape[0]
Out[13]:
8504
After cleaning¶
In [14]:
(df.isnull().sum()/df.shape[0])*100
Out[14]:
Order ID 0.0 Customer Name 0.0 Category 0.0 Sub Category 0.0 City 0.0 Order Date 0.0 Region 0.0 Sales 0.0 Discount 0.0 Profit 0.0 State 0.0 dtype: float64
In [15]:
df.info() # info function is show the all column datatype and not null valu
<class 'pandas.core.frame.DataFrame'> Index: 8504 entries, 0 to 9993 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Order ID 8504 non-null object 1 Customer Name 8504 non-null object 2 Category 8504 non-null object 3 Sub Category 8504 non-null object 4 City 8504 non-null object 5 Order Date 8504 non-null object 6 Region 8504 non-null object 7 Sales 8504 non-null float64 8 Discount 8504 non-null float64 9 Profit 8504 non-null float64 10 State 8504 non-null object dtypes: float64(3), object(8) memory usage: 797.2+ KB
Convert 'Order Date' to datetime format¶
Extract day, month, and year from 'Order Date'¶
In [16]:
df['Order Date']=pd.to_datetime(df['Order Date'],format='mixed')
In [17]:
df['Order Date'].head()
Out[17]:
0 2017-11-08 1 2017-11-08 2 2017-06-12 3 2016-10-11 4 2016-10-11 Name: Order Date, dtype: datetime64[ns]
After convert datetime¶
In [18]:
df['Order Year']=df['Order Date'].dt.year
df['Order month']=df['Order Date'].dt.month
df['Order day']=df['Order Date'].dt.day
df.head(3)
Out[18]:
| Order ID | Customer Name | Category | Sub Category | City | Order Date | Region | Sales | Discount | Profit | State | Order Year | Order month | Order day | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | OD1 | Harish | Oil & Masala | Masalas | Vellore | 2017-11-08 | North | 1254.0 | 0.12 | 401.28 | Tamil Nadu | 2017 | 11 | 8 |
| 1 | OD2 | Sudha | Beverages | Health Drinks | Krishnagiri | 2017-11-08 | South | 749.0 | 0.18 | 149.80 | Tamil Nadu | 2017 | 11 | 8 |
| 2 | OD3 | Hussain | Food Grains | Atta & Flour | Perambalur | 2017-06-12 | West | 2360.0 | 0.21 | 165.20 | Tamil Nadu | 2017 | 6 | 12 |
In [19]:
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 8504 entries, 0 to 9993 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Order ID 8504 non-null object 1 Customer Name 8504 non-null object 2 Category 8504 non-null object 3 Sub Category 8504 non-null object 4 City 8504 non-null object 5 Order Date 8504 non-null datetime64[ns] 6 Region 8504 non-null object 7 Sales 8504 non-null float64 8 Discount 8504 non-null float64 9 Profit 8504 non-null float64 10 State 8504 non-null object 11 Order Year 8504 non-null int32 12 Order month 8504 non-null int32 13 Order day 8504 non-null int32 dtypes: datetime64[ns](1), float64(3), int32(3), object(7) memory usage: 896.9+ KB
In [20]:
df.describe() # describe functionn is describe the all numerical column
Out[20]:
| Order Date | Sales | Discount | Profit | Order Year | Order month | Order day | |
|---|---|---|---|---|---|---|---|
| count | 8504 | 8504.000000 | 8504.000000 | 8504.000000 | 8504.000000 | 8504.000000 | 8504.000000 |
| mean | 2017-04-30 20:05:48.824082688 | 1492.189675 | 0.226646 | 373.913921 | 2016.724835 | 7.797037 | 15.518109 |
| min | 2015-01-03 00:00:00 | 500.000000 | 0.100000 | 25.250000 | 2015.000000 | 1.000000 | 1.000000 |
| 25% | 2016-05-24 00:00:00 | 992.000000 | 0.160000 | 180.442500 | 2016.000000 | 5.000000 | 8.000000 |
| 50% | 2017-06-26 00:00:00 | 1494.000000 | 0.230000 | 318.615000 | 2017.000000 | 9.000000 | 16.000000 |
| 75% | 2018-05-14 00:00:00 | 1989.250000 | 0.290000 | 522.727500 | 2018.000000 | 11.000000 | 23.000000 |
| max | 2018-12-30 00:00:00 | 2500.000000 | 0.350000 | 1120.950000 | 2018.000000 | 12.000000 | 31.000000 |
| std | NaN | 578.301393 | 0.075038 | 239.218627 | 1.120837 | 3.290614 | 8.749222 |
In [21]:
df.drop('Order Date',axis=1,inplace=True) ## this fumnction is drop the columns
In [22]:
df.head()
Out[22]:
| Order ID | Customer Name | Category | Sub Category | City | Region | Sales | Discount | Profit | State | Order Year | Order month | Order day | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | OD1 | Harish | Oil & Masala | Masalas | Vellore | North | 1254.0 | 0.12 | 401.28 | Tamil Nadu | 2017 | 11 | 8 |
| 1 | OD2 | Sudha | Beverages | Health Drinks | Krishnagiri | South | 749.0 | 0.18 | 149.80 | Tamil Nadu | 2017 | 11 | 8 |
| 2 | OD3 | Hussain | Food Grains | Atta & Flour | Perambalur | West | 2360.0 | 0.21 | 165.20 | Tamil Nadu | 2017 | 6 | 12 |
| 3 | OD4 | Jackson | Fruits & Veggies | Fresh Vegetables | Dharmapuri | South | 896.0 | 0.25 | 89.60 | Tamil Nadu | 2016 | 10 | 11 |
| 4 | OD5 | Ridhesh | Food Grains | Organic Staples | Ooty | South | 2355.0 | 0.26 | 918.45 | Tamil Nadu | 2016 | 10 | 11 |
In [46]:
data=df
In [47]:
data = pd.DataFrame(df)
In [49]:
data.to_csv("df.csv", index=False)
-> region_sales , region_profit , region by discount¶
In [ ]:
In [ ]:
In [23]:
# Aggregate data by region
region_sales = df.groupby('Region')['Sales'].sum().reset_index()
region_profit = df.groupby('Region')['Profit'].sum().reset_index()
region_discount = df.groupby('Region')['Discount'].mean().reset_index()
# Merge all metrics into one dataframe
region_summary = region_sales.merge(region_profit, on='Region')
region_summary = region_summary.merge(region_discount, on='Region')
# Rename columns
region_summary.columns = ['Region', 'Total Sales', 'Total Profit', 'Average Discount']
# Display the results
region_summary
Out[23]:
| Region | Total Sales | Total Profit | Average Discount | |
|---|---|---|---|---|
| 0 | Central | 2951514.0 | 730219.49 | 0.228122 |
| 1 | East | 3570538.0 | 909880.50 | 0.227015 |
| 2 | North | 1254.0 | 401.28 | 0.120000 |
| 3 | South | 2079498.0 | 529834.71 | 0.226190 |
| 4 | West | 4086777.0 | 1009428.00 | 0.225514 |
-> Data is neat and clean and its ready to EDA & visualization then find the insight in Grocessary sales analysis dataset¶
step-4 EDA & Visualization¶
In [24]:
sns.set_theme(style="whitegrid")
# Create figure
plt.figure(figsize=(6, 5))
# Plot value counts with color palette
ax = df['Category'].value_counts().plot(kind='bar',
color=sns.color_palette("viridis", len(df['Category'].unique())),
edgecolor='black')
# Add labels and title
plt.title("Total Number of Categories", fontsize=14, fontweight='bold')
plt.xlabel("Category", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.xticks(rotation=45, ha='right')
# Add labels to each bar
for container in ax.containers:
ax.bar_label(container, fmt='%d', fontsize=10, padding=3)
# Show the plot
plt.show()
sub-category¶
In [25]:
sns.set_theme(style="whitegrid")
# Create figure
plt.figure(figsize=(6, 5))
# Plot value counts with color palette
ax = df['Sub Category'].value_counts().plot(kind='line' )
# Add labels and title
plt.title("Total Number of Sub Categories", fontsize=14, fontweight='bold')
plt.xlabel("Sub Category", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.xticks(rotation=45, ha='right')
# Show the plot
plt.show()
In [26]:
df.columns.values
Out[26]:
array(['Order ID', 'Customer Name', 'Category', 'Sub Category', 'City',
'Region', 'Sales', 'Discount', 'Profit', 'State', 'Order Year',
'Order month', 'Order day'], dtype=object)
-> Region by sales¶
In [27]:
# Aggregate sales data by region
region_sales = df.groupby('Region')['Sales'].sum().reset_index()
# Visualization - Sales by Region
plt.figure(figsize=(10, 5))
sns.barplot(x='Region', y='Sales', data=region_sales, palette='Blues')
# Add text labels on bars
for index, row in region_sales.iterrows():
plt.text(index, row['Sales'], f'{row["Sales"]:.2f}', ha='center', va='bottom')
# Set labels and title
plt.xlabel("Region")
plt.ylabel("Total Sales")
plt.title("Total Sales by Region")
plt.xticks(rotation=45)
# Show plot
plt.show()
# Display the results
region_sales
Out[27]:
| Region | Sales | |
|---|---|---|
| 0 | Central | 2951514.0 |
| 1 | East | 3570538.0 |
| 2 | North | 1254.0 |
| 3 | South | 2079498.0 |
| 4 | West | 4086777.0 |
-> Region by Profit¶
In [28]:
# Visualization - Profit by Region
plt.figure(figsize=(10, 5))
sns.barplot(x='Region', y='Profit', data=region_profit, palette='Greens')
# Add text labels on bars
for index, row in region_profit.iterrows():
plt.text(index, row['Profit'],f'{row["Profit"]:.2f}', ha='center', va='bottom')
# Set labels and title
plt.xlabel("Region")
plt.ylabel("Total Profit")
plt.title("Total Profit by Region")
plt.xticks(rotation=45)
# Show plot
plt.show()
# Display the results
region_profit
Out[28]:
| Region | Profit | |
|---|---|---|
| 0 | Central | 730219.49 |
| 1 | East | 909880.50 |
| 2 | North | 401.28 |
| 3 | South | 529834.71 |
| 4 | West | 1009428.00 |
Region by discount¶
In [29]:
region_Discountt = df.groupby('Region')['Discount'].mean().reset_index()
region_Discountt
Out[29]:
| Region | Discount | |
|---|---|---|
| 0 | Central | 0.228122 |
| 1 | East | 0.227015 |
| 2 | North | 0.120000 |
| 3 | South | 0.226190 |
| 4 | West | 0.225514 |
In [30]:
# Aggregate discount data by region
region_discount = df.groupby('Region')['Discount'].mean().reset_index()
# Visualization - Discount by Region
plt.figure(figsize=(10, 5))
sns.barplot(x='Region', y='Discount', data=region_discount, palette='Reds')
# Add text labels on bars
for index, discount in enumerate(region_discount['Discount']):
plt.text(index, discount, f'{discount:.2f}', ha='center', va='bottom')
# Set labels and title
plt.xlabel("Region")
plt.ylabel("Average Discount")
plt.title("Average Discount by Region")
plt.xticks(rotation=45)
# Show plot
plt.show()
# Display the results
print(region_discount)
Region Discount 0 Central 0.228122 1 East 0.227015 2 North 0.120000 3 South 0.226190 4 West 0.225514
In [ ]:
-> Region by sales ,profit and discount¶
In [ ]:
In [ ]:
In [31]:
# Aggregate data by region
region_summary = df.groupby('Region').agg({'Sales': 'sum', 'Profit': 'sum','Discount':'mean'}).reset_index()
# Rename columns
region_summary.columns = ['Region', 'Total Sales', 'Total Profit', 'Average Discount']
# Visualization - Merging all metrics into one graph
plt.figure(figsize=(19, 6))
# Set position for bars
bar_width = 0.3
positions = range(len(region_summary['Region']))
# Create bars for each metric
bars1 = plt.bar(positions, region_summary['Total Sales'], width=bar_width, label='Total Sales', color='blue', alpha=0.7)
bars2 = plt.bar([p + bar_width for p in positions], region_summary['Total Profit'], width=bar_width, label='Total Profit', color='green', alpha=0.7)
bars3 = plt.bar([p + 2 * bar_width for p in positions], region_summary['Average Discount'], width=bar_width, label='Average Discount', color='red', alpha=0.8)
# Add text labels on bars
def add_labels(bars):
for bar in bars:
height = bar.get_height()
plt.text(bar.get_x() + bar.get_width() / 2, height, f'{height:.2f}', ha='center', va='bottom')
add_labels(bars1)
add_labels(bars2)
add_labels(bars3)
# Set labels and title
plt.xlabel("Region")
plt.ylabel("Values")
plt.title("Sales, Profit, and Discount by Region",fontsize=14, fontweight='bold')
plt.xticks([p + bar_width for p in positions], region_summary['Region'], rotation=45)
plt.legend()
# Show plot
plt.show()
# Display the results
region_summary
Out[31]:
| Region | Total Sales | Total Profit | Average Discount | |
|---|---|---|---|---|
| 0 | Central | 2951514.0 | 730219.49 | 0.228122 |
| 1 | East | 3570538.0 | 909880.50 | 0.227015 |
| 2 | North | 1254.0 | 401.28 | 0.120000 |
| 3 | South | 2079498.0 | 529834.71 | 0.226190 |
| 4 | West | 4086777.0 | 1009428.00 | 0.225514 |
West region leads in both sales (4M+) and profit (1M+), indicating strong market performance despite offering similar discounts as other high-performing regions.¶
North region has extremely low sales (1,254) and profit (401), possibly due to its significantly lower discount (0.12), affecting customer attraction and revenue.¶
Central, East, and South regions maintain a balance with moderate sales, profit, and uniform discounts (0.23), suggesting a consistent pricing strategy.¶
Total Profit and Sales by State¶
In [32]:
# Group data by 'State' and sum 'Sales' and 'Profit'
state_summary = df.groupby('State').agg({'Sales': 'sum', 'Profit': 'sum','Discount':'mean'}).reset_index()
#rename column
state_summary.columns=['State','Total Sales','Total Profit','Average Discount']
state_summary
Out[32]:
| State | Total Sales | Total Profit | Average Discount | |
|---|---|---|---|---|
| 0 | Tamil Nadu | 12689581.0 | 3179763.98 | 0.226646 |
Total Profit and Total Sales by Category¶
In [33]:
# Group data by 'Category' and sum 'Sales' and 'Profit'
Category_summary = df.groupby('Category').agg({'Sales': 'sum', 'Profit': 'sum','Discount':'mean'}).reset_index()
#rename the column
Category_summary.columns=['Category','Total Sales','Total Profit','Total Discount']
Category_summary
Out[33]:
| Category | Total Sales | Total Profit | Total Discount | |
|---|---|---|---|---|
| 0 | Bakery | 2112281.0 | 528521.06 | 0.225428 |
| 1 | Beverages | 2085313.0 | 525605.76 | 0.230329 |
| 2 | Food Grains | 2115272.0 | 529162.64 | 0.228548 |
| 3 | Fruits & Veggies | 2100727.0 | 530400.38 | 0.229295 |
| 4 | Oil & Masala | 2038442.0 | 497895.29 | 0.224526 |
| 5 | Snacks | 2237546.0 | 568178.85 | 0.222048 |
In [34]:
# Aggregate sales and profit by category
category_summary = df.groupby("Category")[["Sales", "Profit"]].sum().reset_index()
# Plot the bar chart
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(category_summary["Category"], category_summary["Sales"], marker='o', linestyle='-', color="blue", label="Sales")
ax.plot(category_summary["Category"], category_summary["Profit"], marker='o', linestyle='-', color="green", label="Profit")
#category_summary.plot(kind="line", x="Category", y=["Sales", "Profit"], ax=ax, color=["blue", "green"])
# Customize the plot
plt.title("Total Sales and Profit by Category")
plt.ylabel("Amount (in currency units)")
plt.xlabel("Category")
plt.xticks(rotation=45)
plt.legend(["Sales", "Profit"])
plt.grid(axis="y", linestyle="--", alpha=0.7)
# Show the plot
plt.show()
In [35]:
# Aggregate data by Category
Category_summary = df.groupby('Category').agg({'Sales': 'sum', 'Profit': 'sum','Discount':'mean'}).reset_index()
# Rename columns
Category_summary.columns = ['Category', 'Total Sales', 'Total Profit', 'Average Discount']
# Visualization - Merging all metrics into one graph
plt.figure(figsize=(19, 6))
# Set position for bars
bar_width = 0.3
positions = range(len(Category_summary['Category']))
# Create bars for each metric
bars1 = plt.bar(positions, Category_summary['Total Sales'], width=bar_width, label='Total Sales', color='blue', alpha=0.7)
bars2 = plt.bar([p + bar_width for p in positions], Category_summary['Total Profit'], width=bar_width, label='Total Profit', color='green', alpha=0.7)
bars3 = plt.bar([p + 2 * bar_width for p in positions],Category_summary['Average Discount'], width=bar_width, label='Average Discount', color='red', alpha=0.8)
# Add text labels on bars
def add_labels(bars):
for bar in bars:
height = bar.get_height()
plt.text(bar.get_x() + bar.get_width() / 2, height, f'{height:.2f}', ha='center', va='bottom')
add_labels(bars1)
add_labels(bars2)
add_labels(bars3)
# Set labels and title
plt.xlabel("Category")
plt.ylabel("Values")
plt.title("Sales, Profit, and Discount by Category",fontsize=14, fontweight='bold')
plt.xticks([p + bar_width for p in positions], Category_summary['Category'], rotation=45)
plt.legend()
# Show plot
plt.show()
Top Sales Category: Snacks have the highest total sales (~2.23M), followed closely by Bakery and Food Grains.¶
Profitability Variation: Although sales are high across categories, profit margins vary, with Bakery and Snacks leading in profits.¶
Zero Sales & Profit: The "Eggs" category shows zero sales and profit, possibly indicating missing data or lack of sales.¶
Balanced Performance: Categories like Fruits & Veggies and Oil & Masala have steady sales and moderate profits.¶
In [ ]:
In [ ]:
Sub Category by sales and profit¶
In [ ]:
In [ ]:
In [36]:
# Group data by 'Sub Category' and sum 'Sales' and 'Profit'
Sub_Category_summary = df.groupby('Sub Category').agg({'Sales': 'sum', 'Profit': 'sum','Discount':'mean'}).reset_index()
#rename the column
Sub_Category_summary.columns=['Sub Category','Total Sales','Total Profit','Total Discount']
Sub_Category_summary.head(4)
Out[36]:
| Sub Category | Total Sales | Total Profit | Total Discount | |
|---|---|---|---|---|
| 0 | Atta & Flour | 534649.0 | 127861.10 | 0.227875 |
| 1 | Biscuits | 684083.0 | 169357.62 | 0.225033 |
| 2 | Breads & Buns | 742586.0 | 190764.98 | 0.226494 |
| 3 | Cakes | 685612.0 | 168398.46 | 0.224646 |
In [37]:
sub_category_summary = df.groupby("Sub Category")[["Sales", "Profit"]].sum().reset_index()
# Plot the bar chart
fig, ax = plt.subplots(figsize=(10, 6))
sub_category_summary.plot(kind="line", x="Sub Category", y=["Sales", "Profit"] ,ax=ax, color=["blue", "green"])
# Customize the plot
plt.title("Total Sales and Profit by sub Category")
plt.ylabel("Amount (in currency units)")
plt.xlabel("Sub Category")
plt.xticks(rotation=45)
plt.legend(["Sales", "Profit"])
plt.grid(axis="y", linestyle="--", alpha=0.7)
# Show the plot
plt.show
Out[37]:
<function matplotlib.pyplot.show(close=None, block=None)>
Sales Peaks: Certain subcategories, such as "Health Drinks" and "Organic Vegetables," have significantly high sales compared to others.¶
Profit Trends: Although sales fluctuate, profit remains relatively stable across subcategories, with minor variations.¶
High Sales, Low Profit: Some subcategories show high sales but comparatively lower profits, indicating possible high costs or discounts.¶
Growth Opportunities: Categories with lower profits despite high sales may need pricing or cost optimization for better profitability¶
In [ ]:
In [ ]:
In [ ]:
In [ ]:
City by sale and profit¶
In [38]:
# Group data by city and calculate total sales and profit
City_summary = df.groupby("City")[["Sales", "Profit"]].sum().reset_index()
# Plot the scatter chart
fig, ax = plt.subplots(figsize=(12, 6))
# Scatter plot for Sales
ax.scatter(City_summary["City"], City_summary["Sales"], color="blue", label="Sales")
# Scatter plot for Profit
ax.scatter(City_summary["City"], City_summary["Profit"], color="green", label="Profit")
# Customize the plot
plt.title("Total Sales and Profit by City")
plt.ylabel("Amount (in currency units)")
plt.xlabel("City")
plt.xticks(rotation=45)
plt.legend()
plt.grid(axis="y", linestyle="--", alpha=0.7)
# Show the plot
plt.show()
In [ ]:
In [ ]:
In [ ]:
In [39]:
print(df.columns)
Index(['Order ID', 'Customer Name', 'Category', 'Sub Category', 'City',
'Region', 'Sales', 'Discount', 'Profit', 'State', 'Order Year',
'Order month', 'Order day'],
dtype='object')
In [40]:
df['Order month'].unique()
Out[40]:
array([11, 6, 10, 4, 12, 5, 8, 7, 1, 9, 3, 2])
Order year by sales , profit and Discount¶
In [41]:
# Group by Order Year and sum Sales, Profit, and Discount
yearly_summary = df.groupby("Order Year")[["Sales", "Profit", "Discount"]].sum().reset_index()
# Create Pie Charts
fig, axes = plt.subplots(1, 3, figsize=(18, 6))
# Sales Pie Chart
axes[0].pie(yearly_summary["Sales"], labels=yearly_summary["Order Year"], autopct='%1.1f%%', colors=['blue', 'lightblue'])
axes[0].set_title("Sales by Year")
# Profit Pie Chart
axes[1].pie(yearly_summary["Profit"], labels=yearly_summary["Order Year"], autopct='%1.1f%%', colors=['green', 'lightgreen'])
axes[1].set_title("Profit by Year")
# Discount Pie Chart
axes[2].pie(yearly_summary["Discount"], labels=yearly_summary["Order Year"], autopct='%1.1f%%', colors=['red', 'lightcoral'])
axes[2].set_title("Discount by Year")
# Show Plot
plt.tight_layout()
plt.show()
Sales, profit, and discounts all show a consistent increase over the years, with 2018 contributing the highest share in all three categories.¶
The year 2017 also had a significant share, indicating strong business performance before reaching its peak in 2018.¶
The lowest share is observed in 2015 across all metrics, suggesting a steady growth trend in sales, profit, and discount allocation over time.¶
In [ ]:
sales and profit by order year¶
In [42]:
import pandas as pd
import plotly.express as px
# Grouping data by Order Year
yearly_summary = df.groupby("Order Year")[["Sales", "Profit"]].sum().reset_index()
# Create Bar Chart
fig = px.bar(
yearly_summary,
x="Order Year",
y=["Sales", "Profit"],
barmode="group", # Group bars side by side
title="Total Sales and Profit by Order Year",
labels={"value": "Amount (in currency units)", "Order Year": "Year"},
color_discrete_map={"Sales": "blue", "Profit": "green"}
)
# Show figure
fig.show()
In [ ]:
In [ ]:
Sales and profit by order month¶
In [43]:
# Group data by Order Month
Order_month_summary = df.groupby("Order month")[["Sales", "Profit"]].sum().reset_index()
# Plot the line chart
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(Order_month_summary["Order month"], Order_month_summary["Sales"], marker='o', linestyle='-', color="blue", label="Sales")
ax.plot(Order_month_summary["Order month"], Order_month_summary["Profit"], marker='o', linestyle='-', color="green", label="Profit")
# Customize the plot
plt.title("Total Sales and Profit by Order Month")
plt.ylabel("Amount (in currency units)")
plt.xlabel("Order Month")
plt.xticks(Order_month_summary["Order month"]) # Ensure proper month ticks
plt.legend()
plt.grid(axis="y", linestyle="--", alpha=0.7)
# Show the plot
plt.show()
In [ ]:
In [ ]:
In [ ]:
sales and profit by order day¶
In [44]:
# Group data by Order day
Order_day_summary = df.groupby("Order day")[["Sales", "Profit"]].sum().reset_index()
# Plot the line chart
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(Order_day_summary["Order day"], Order_day_summary["Sales"], marker='o', linestyle='-', color="blue", label="Sales")
ax.plot(Order_day_summary["Order day"], Order_day_summary["Profit"], marker='o', linestyle='-', color="green", label="Profit")
# Customize the plot
plt.title("Total Sales and Profit by Order day")
plt.ylabel("Amount (in currency units)")
plt.xlabel("Order day")
plt.xticks(Order_month_summary["Order day"]) # Ensure proper month ticks
plt.legend()
plt.grid(axis="y", linestyle="--", alpha=0.7)
# Show the plot
plt.show()
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) File C:\ProgramData\anaconda3\Lib\site-packages\pandas\core\indexes\base.py:3805, in Index.get_loc(self, key) 3804 try: -> 3805 return self._engine.get_loc(casted_key) 3806 except KeyError as err: File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc() File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc() File pandas\\_libs\\hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item() File pandas\\_libs\\hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 'Order day' The above exception was the direct cause of the following exception: KeyError Traceback (most recent call last) Cell In[44], line 13 11 plt.ylabel("Amount (in currency units)") 12 plt.xlabel("Order day") ---> 13 plt.xticks(Order_month_summary["Order day"]) # Ensure proper month ticks 14 plt.legend() 15 plt.grid(axis="y", linestyle="--", alpha=0.7) File C:\ProgramData\anaconda3\Lib\site-packages\pandas\core\frame.py:4102, in DataFrame.__getitem__(self, key) 4100 if self.columns.nlevels > 1: 4101 return self._getitem_multilevel(key) -> 4102 indexer = self.columns.get_loc(key) 4103 if is_integer(indexer): 4104 indexer = [indexer] File C:\ProgramData\anaconda3\Lib\site-packages\pandas\core\indexes\base.py:3812, in Index.get_loc(self, key) 3807 if isinstance(casted_key, slice) or ( 3808 isinstance(casted_key, abc.Iterable) 3809 and any(isinstance(x, slice) for x in casted_key) 3810 ): 3811 raise InvalidIndexError(key) -> 3812 raise KeyError(key) from err 3813 except TypeError: 3814 # If we have a listlike key, _check_indexing_error will raise 3815 # InvalidIndexError. Otherwise we fall through and re-raise 3816 # the TypeError. 3817 self._check_indexing_error(key) KeyError: 'Order day'
In [ ]: